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ABSTRACT 

We aim to obtain a complete sample of redshift z ^ 3.6 radio QSOs from FIRST 
sources (51.4 ghz > 1 mJy) having star-like counterparts in the SDSS DR5 photomet- 
ric survey {tab ^ 20.2). Our starting sample of 8665 FIRST-DR5 pairs includes 4250 
objects with spectra in DR5, 52 of these being z ^ 3.6 QSOs. We found that sim- 
ple supervised neural networks, trained on the sources with DR5 spectra, and using 
optical photometry and radio data, are very effective for identifying high-z QSOs in 
a sample without spectra. For the sources with DR5 spectra the technique yields a 
completeness (fraction of actual high-z QSOs classified as such by the neural network) 
of 96 per cent, and an efficiency (fraction of objects selected by the neural network as 
high-z QSOs that actually are high-z QSOs) of 62 per cent. Applying the trained net- 
works to the 4415 sources without DR5 spectra we found 58 2 > 3.6 QSO candidates. 
We obtained spectra of 27 of them, and 17 are confirmed as high-z QSOs. Spectra 
of 13 additional candidates from the literature and from SDSS DR6 revealed seven 
more z ^ 3.6 QSOs, giving an overall efhciency of 60 per cent (24/40). None of the 
non-candidates with spectra from NED or DR6 is a z ^ 3.6 QSO, consistently with 
a high completeness. The initial sample of high-z QSOs is increased from 52 to 76 
sources, i.e. by a factor 1.46. From the new identifications and candidates we estimate 
an incompleteness of SDSS for the spectroscopic classification of FIRST 3.6 ^ z ^ 4.6 
QSOs of 15 per cent for r 20.2. 

Key words: methods: data analysis - surveys - quasars: general - galaxies: high 
redshift - early Universe - radio continuum: galaxies 



1 INTRODUCTION 

Homogeneous statistical samples of high-redshift quasi- 
stellar objects (QSOs) allow not only investigation of the 
QSO phenomenon itself, but also provide important infor- 
mation for a wide variety of studies. In particular, the lumi- 
nosity function of high-redshift QSOs provides strong con- 
straints on the theory of the accretion of matter onto su- 
permassive black holes in the nuclei of galaxies. The in- 
creasing evidence for a relation between the formation of 
galaxy bulges and supermassive black holes (Kormendy & 
Richstone 1995, Magorrian et al. 1998) emphasises the im- 
portance of understanding the role of QSO activity in the 
formation and evolution of galaxies. The luminosity func- 
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tion of QSOs is also essential to quantify their contribution 
to the X-ray background and the UV ionising flux at high 
redshift. In addition, the absorption spectra of these QSOs 
reveal the state of the intergalactic medium at early epochs. 

Although radio-loud (RL) QSOs are a small subset of 
the QSO population, samples of high-redshift RL QSOs ben- 
efit from higher completeness, due to the drastically reduced 
contamination by stars in samples of radio selected QSO 
candidates, compared to optically (colour) selected QSO 
candidates (Richards et al. 2006). Moreover, the connection 
between radio and optical activity, which still needs to be 
understood, requires a comparison between radio-loud and 
radio-quiet QSO populations. Ivezik et al. (2004) provide 
conclusive evidence that the distribution of radio-to-optical 
flux ratio for QSOs, i.e. the radio loudness, is bimodal (the 
so-called QSO radio dichotomy), on the basis of accurate 
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optical and radio measurements of a large sample of RL 
QSOs obtained from the Sloan Digital Sky Survey (SDSS; 
York et al. 2000) and the Faint Images of the Radio Sky at 
Twenty cm survey (FIRST; Becker, White & Helfand 1995). 
Many studies suggest that RL QSOs reside in more massive 
galaxies and harbour more massive central black holes than 
radio-quiet QSOs, but the point is still controversial (see 
references for and against these arguments at Cirasuolo et 
al. 2006). A recent study by Jiang et al. (2007) based on a 
QSO sample drawn from SDSS and FIRST shows that the 
fraction of RL QSOs decreases with increasing redshift and 
with decreasing optical luminosity. 

We aim to obtain a homogeneous sample of high- 
redshift RL QSOs {z above ~ 3.6) drawn from correlation of 
the FIRST catalogue (51.4 ghz > 1.0 mjy) with unresolved 
objects in the SDSS Data Release 5 (SDSS DR5, Adelman- 
McCarthy et al. 2007). The area of overlap between FIRST 
and the DR5 imaging survey is ~ 7391 deg^ and the number 
of selected FIRST-SDSS matches is 8665 (Section 2). SDSS 
provides: (i) ugriz photometry, which is a powerful tool for 
separating high-z QSOs from other populations (e.g. stars, 
QSOs with z below ^ 3.6 or unresolved low-z galaxies); (ii) 
morphological classification, essential for distinguishing be- 
tween high-z QSOs and galaxies or resolved low-z active 
galactic nuclei; and (iii) spectroscopy of many of our can- 
didates (4250), selected as spectroscopic targets by SDSS 
DR5. Since SDSS spectroscopic observations necessarily lag 
the imaging, the total DR5 spectroscopic area is lower, with 
5553 deg^ included in the overlap with FIRST. Most of 
the candidates with available spectroscopy were classified 
by SDSS as QSOs, i.e. have a secure detection of a high- 
excitation emission line with FWHM ^ 1000 km sec^^. The 
rest are galaxies, staxs and objects of 'unknown' class. 52 
DR5 sources were spectroscopically classified as z ^ 3.6 
QSOs (Table 1). 

Our approach to obtaining a high- 2: QSO sample was to 
extend the existing sample of 52 FIRST-DR5 high-z QSOs 
by applying automated learning techniques, specifically neu- 
ral networks (NNs), to the 8665 FIRST-SDSS DR5 photo- 
metric matches. NNs have been shown to be powerful tools 
for both classification and regression tasks, in many fields 
of astronomy, and have subsequently been applied to pre- 
dict object classes and/or astrophysical parameters. Fields 
where NNs have been applied include: classification of stellar 
spectra (Bailer- Jones, Irwin & von Hippel 1998); morpholog- 
ical star/galaxy separation (e. g. Bertiri & Arnouts 1996); 
morphological classification, spectral typing and/or photo- 
metric redshifts of galaxies [Folkes, Lahav & Maddox 1996; 
Lahav et al. 1996; Firth, Lahav & Somerville 2003; CoUister 
& Lahav 2004 (this paper reports the popular photomet- 
ric redshift code ANNz); Ball et al. 2004]; QSO identifica- 
tion and/or QSO photometric redshifts (Caxballo et al. 2004; 
Claeskens et al. 2006); and cross-matching of astronomical 
catalogues (Rohde et al. 2005). 

QSO selection and estimation of QSO photometric red- 
shifts are of prime importance for the SDSS project. Various 
studies address the problem using different machine learning 
approaches. Richards et al. (2004) applied a probability den- 
sity analysis based on kernel density estimation of the colour 
distribution of stars and spectroscopically confirmed QSOs 
in SDSS DRl, to classify, as stars or QSOs, a catalogue of 
over 10^ unresolved, g < 21 mag, UV-excess {u — g ^ 1) 



QSO candidates. The resulting efficiency and completeness 
(the latter evaluated for g ^ 19.5) for the selection of QSOs 
in the candidate sample was estimated to be around 95 per 
cent up to a ~ 2.4 — 3.0, the redshift limit mainly arising 
from the restriction of the catalogue to UV-excess objects. 

Suchkov, Hanish and Margon (2005) applied the oblique 
decision tree classifier ClassX to classify SDSS-DR2 photo- 
metric objects into 25 classes [stars, red stars, 10 redshift 
bins for galaxies, and 13 for Active Galactic Nuclei (here- 
after AGN)] using colour information and morphology (at- 
tributes 'resolved' or 'unresolved') from SDSS. For each of 
the 12 redshift bins for AGN with = 0.2 and covering 
^ z $5 2.4, the completeness obtained for the test sample 
is in the range from 43 to 81 per cent, with an average 63 per 
cent. For the high-redshift bin, in the range z = 2.4 — 6.0, 
the completeness drops to 14 per cent, the remaining high-« 
AGNs beeing classifified as stars (47 per cent) or as AGN in 
the adyacent redshift bin 2.2 ^ 2: ^ 2.4 (39 per cent). This 
result illustrates the difficulty in separating high-z QSOs 
from other classes. The efficiency or fraction of true high-z 
QSOs among the sources classified in the AGN high-« bin 
was ~ 75 per cent. 

Ball et al. (2006) applied decision trees, trained on the 
SDSS-DR3 objects with available spectroscopy, to classify 
all photometric objects (> 10*) in SDSS-DR3 in one of the 
three categories of star, galaxy or nsng (neither star nor 
galaxy), the latter including QSOs and 'unknown'. A blind 
test on the 2dF QSO Redshift Survey (2QZ; Crom et al. 
2004), using the 8739 QSOs matching 2QZ and SDSS-DR3, 
yielded 95 per cent completeness and 87 per cent efficiency. 
The authors do not discuss how the performance depends 
on redshift. 

Bazell, Miller and SubbaRao (2006) use a semisuper- 
vised mixture model approach to analyze 10000 objects spec- 
trosopically classified in SDSS-DR4 in the categories of stars, 
late-type stars, galaxies and QSOs with 2 3 and unknown, 
using as input data for the modelling SDSS colours and the 
spectroscopic class. Since the aim was to investigate the ex- 
istence of possible new object types among the class of 'un- 
known' as well as subclasses among the remaining classes, 
90 per cent of the sources in the categories of stars, late-type 
stars, galaxies and QSOs were also treated as unknown dur- 
ing the modelling. The best model includes 16 components, 
two of them of the nonpredefined type, and one of the lat- 
ter captures a region of the u — g versus g — r colour-colour 
diagram (2 ^ u — <; ^ 5, 0.5 ^ g — r ^ 1.5) within the 
location of high-2 QSOs in Richards et al. (2002), but in- 
tentionally rejected in the QSO selection by Richards et al. 
(2004) because of the high density of stellar contaminants 
in that region. 

Gao, Zhang and Zhao (2008) compare the performance 
of fc-dimensional trees and support vector machines in the 
separation between stars and QSOs, using a sample of 
stars and QSOs spectroscopically classified in SDSS-DR5 
and having a counterpart at the Two-Micron All Sky Sur- 
vey (2MASS, Cutri et al. 2003). Both techniques yield a 
global efficiency and completeness as large as 97 per cent 
for ^ 2 ^ 2.5. However, again the accuracy drops signifi- 
cantly for z > 2.5. 

Our work deals with the selection of high-2 QSOs from 
SDSS-FIRST matches. As stated before, our restriction to 
radio detected sources drastically reduces the contamina^ 
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tion by stars, enabling us to obtain classification accuracies 
at these redshifts better than those obtained in more gen- 
eral studies aimed at the selection of the whole population of 
QSOs, regardless of radio detection. The paper is structured 
as follows. The sample of FIRST-DR5 matches is presented 
in Section 2. In Section 3 we explore the performance of 
supervised NNs to separate high-z QSOs from the remain- 
ing spectral classes in the sample of 4250 sources with DR5 
spectra, using multiband optical photometry and radio data. 
In Sections 4.1 and 4.2, we apply the trained NNs to the 
sample of 4415 sources without DR5 spectra, identifying 58 
high-z QSO candidates. In Section 4.3 we check the reliabil- 
ity of this identification via comparison with spectra from 
the NASA Extragalactic Database (NED), SDSS DR6 and 
follow-up spectroscopy with the WHT. The discussion and 
conclusions are presented in Section 5. 



2 SELECTION OF THE SAMPLE 

As an initial sample we selected all FIRST sources with 
an unresolved object in the PhotoPrimarjQ view of the 
SDSS DR5 Catalog Archive Server (CAS), within 1.5 arc- 
sec of the radio position, with dereddened psf magnitude 
15 ^ Tab < 20.2 and 'clean' photometry (i.e. reject- 
ing objects with magnitude errors > 0.2 in all five bands 
ugriz, or flagged as 'BRIGHT', 'SATURATED', 'EDGE', 
'BLENDED' or 'CHILD'). We selected the r band because 
QSOs with redshifts 3.6 ^ z ^ 4.5 are expected to have an 
enhanced emission at this band, due to the Lyo? emission 
line falling within the covered spectral range. Vigotti et al. 
(2003) estimated that more than 99 per cent of FIRST- APM 
quasars with 3.8 < z ^ 4.5, E ^ 18.8 and 5*1.4 ghz > 1.0 
mjy fall within 1.5 arcsec of the POSS I positions, and this 
matching radius was adopted for this work. In total 8665 
FIRST sources fulfil the above requirements. Because of the 
exclusion of 'CHILD' objects (objects which are the prod- 
uct of deblending a blended object) , in all cases there is one 
optical object per radio source. The corrections for Galactic 
extinction, derived from Schlegel et al. (1998), were taken 
from SDSS. 4250 of the sources (49 per cent of the sam- 
ple) have DR5 spectra (specifically, they are included in the 
SpecDbj0 view of the DR5 CAS). In fact, the magnitude 
limit r = 20.2 was set to ensure an approximately similar 
fraction of sources with and without spectra at DR5. The 
distribution of SDSS spectral types and redshifts for the ob- 
jects with DR5 spectra, as quoted in SpecObj, are given in 
Table 1. The redshift distribution of the QSOs is shown in 
Fig. 1. 

The redshifts of the QSOs with z ^ 3.6 were checked 
by visual inspection of the DR5 spectra. For two of them, 
SDSS 130941.36-H12540.1 and SDSS 153420.23+413007.5, 
we found the redshifts provided by the SDSS pipelines to 



^ Best SDSS observation of the object, and the object is located 
within the imaging survey area which has been finished to date. 
See |http:/ /cas. s dss.org/ astrodrS/ en/help/docs / tabledesc.asp 
^ This implies that the object was selected for spec- 
troscopy as an SDSS science object, and that the 
spectrum was taken on a main survey plate. See 
|http://cas. sdss.org/astrodr5/en/help/docs/tabledesc. asp I 



Table 1. SDSS classification of the 4248 FIRST-DR5 matches 
having SDSS spectra 



Spectral type 


Number 


QSO 2 < 3.6 


3754 


QSO 2 ^ 3.6 


52 


early-type star 


133 


late-type star 


97 


galaxy 


59 


unknown 


153 


Total 


4248 
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Figure 1. Redshift distribution of the 3806 FIRST QSOs with 
DR5 spectra. 

be likely incorrect, and subsequently the QSOs were iden- 
tified in the SDSS DR5 Quasar Catalog (DR5Q; Schneider 
et al. 2007) with revised redshifts z = 1.362 and z = 1.400 
respectively. These two QSOs were not considered further 
and are not included in Table 1 and Fig. 1. On the one 
hand the revised values were published after we had already 
trained the NNs with the objects in Table 1 and carried 
out most of the follow up observations of the selected can- 
didates. On the other, having been misidentified as high-z 
QSOs in DR5 (redshift confidences 0.59 and 0.71 respec- 
tively), their exclusion from the training sample, used for 
the learning, seems reasonable. Since our aim is to obtain 
a high completeness for high-z QSOs, non high-z sources 
whose spectra can be confused with those of high-z QSOs 
should be preferably removed from the training sample, 
since their inclusion could hinder the selection of the high-z 
QSOs whose spectra they resemble. We note that the red- 
shift for SDSS 153420.23-^413007.5 at DR6 has been up- 
dated to the high-confidence manual value z — 1.400, but 
for SDSS 130941.36+112540.1 the redshift obtained with 
the spectroscopic pipelines, z = 4.395 (confidence 0.55), has 
been maintained at DR6. 

The spectral types and redshifts of the remaining 4196 
sources in Table 1, most of them QSOs at z < 3.6, were taken 
directly from DR5, since we did not expect among them any 
z ^ 3.6 QSO with an identification as reliable as that found 
for the 52 high-z QSOs. A visual examination of the DR5 
spectra of 225 of these sources (the first 125 z < 3.6 QSOs, 
25 early-type stars, 25 late-type stars, 25 galaxies, and 25 
'unknown', in order of increasing right ascension) yielded no 
identification as a likely z ^ 3.6 QSO. Moreover, since our 
approach for the automated selection of high-z QSOs groups 
the remaining spectral types as a single class (see Section 3), 
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Table 2. FIRST-DR5 sources with SDSS spectra, and classified by SDSS as z ^ 3.6 QSOs 



RA Dec Tab Si,4 ghz Redshift NN output Previous Notes 

J2000 J2000 mjy 2/med samples 

(1) (2) (3) (4) (5) (6) (7) (8) 



01 53 39.61 


-00 11 05.0 


18.82 


4.75 


4.194 


1.00 






03 00 25.22 


00 32 24.2 


19.66 


7.56 


4.201 


1.00 






07 51 13.05 


31 20 37.9 


19.75 


5.60 


3.761 


0.11 




abs 


07 51 22.35 


45 23 34.2 


20.20 


1.13 


3.608 


0.89 






08 10 09.95 


38 47 57.0 


19.62 


27.16 


3.946 


0.29 






08 38 08.46 


53 48 09.8 


19.94 


8.42 


3.610 


1.00 






08 39 46.22 


51 12 02.8 


19.33 


41.64 


4.390 


1.00 


1,2 




08 40 44.18 


34 11 01.6 


19.79 


13.59 


3.889 


1.00 






08 52 57.12 


24 31 03.1 


19.47 


159.58 


3.617 


0.33 






09 18 24.38 


06 36 53.3 


19.77 


26.50 


4.192 


0.84 


2 


abs 


09 37 14.49 


08 28 58.5 


18.59 


3.17 


3.700 


0.53 






09 40 03.02 


51 16 02.7 


19.99 


13.91 


3.601 


0.03 


3 




10 00 12.26 


10 21 51.8 


19.54 


21.93 


3.638 


0.86 






10 17 47.75 


34 27 37.8 


20.00 


2.63 


3.691 


0.91 






10 30 55.95 


43 20 37.7 


19.84 


37.82 


3.700 


0.75 






10 34 46.54 


11 02 14.5 


18.80 


1.09 


4.266 


0.70 






10 51 21.36 


61 20 38.0 


18.90 


6.64 


3.689 


0.93 






10 57 56.28 


45 55 53.0 


17.44 


1.38 


4.137 


1.00 


1,2,3 




11 10 55.21 


43 05 10.0 


18.59 


1.21 


3.821 


0.81 


3 




11 17 01.89 


13 11 15.4 


18.28 


28.99 


3.624 


0.14 






11 17 36.33 


44 56 55.6 


20.03 


25.08 


3.853 


1.00 






11 25 30.48 


57 57 22.7 


19.41 


2.99 


3.685 


0.70 




abs 


11 27 49.45 


05 11 40.6 


19.13 


2.71 


3.711 


0.12 




abs 


11 29 38.73 


13 12 32.2 


18.77 


1.33 


3.607 


0.21 






11 33 30.91 


38 06 38.1 


19.71 


0.87 


3.631 


0.80 






11 50 45.61 


42 40 01.1 


19.87 


1.51 


3.894 


0.54 






12 04 47.15 


33 09 38.7 


19.24 


0.92 


3.616 


0.56 




BAL 


12 31 42.17 


38 16 58.9 


20.18 


24.04 


4.138 


0.99 






12 40 54.91 


54 36 52.2 


19.74 


15.09 


3.938 


1.00 


3 




12 42 09.81 


37 20 05.6 


19.34 


662.38 


3.819 


0.97 






12 46 58.83 


12 08 54.7 


20.00 


1.44 


3.805 


1.00 




BAL 


12 49 43.67 


15 27 07.0 


19.31 


2.01 


3.995 


0.89 






13 00 02.16 


01 18 23.0 


19.78 


2.52 


4.614 


0.90 






13 03 48.94 


00 20 10.4 


18.90 


0.99 


3.647 


0.99 




BAL 


13 07 38.83 


15 07 52.1 


19.70 


3.89 


4.082 


0.94 






13 15 36.57 


48 56 29.1 


19.77 


10.86 


3.618 


0.96 






13 25 12.49 


11 23 29.7 


19.33 


71.05 


4.409 


1.00 


2 




13 54 06.89 


-02 06 03.2 


19.18 


719.48 


3.715 


0.67 






13 55 54.55 


45 04 21.0 


19.34 


2.07 


4.095 


1.00 






14 08 50.91 


02 05 22.7 


19.07 


1.18 


4.008 


0.06 




abs 


14 12 09.96 


06 24 06.8 


20.17 


43.47 


4.467" 


0.90 




abs 


14 22 09.70 


46 59 32.5 


19.70 


11.03 


3.798 


1.00 






14 23 26.48 


39 12 26.2 


20.15 


6.52 


3.921 


1.00 






14 35 33.77 


54 35 59.3 


20.05 


95.78 


3.810 


0.90 






14 45 42.75 


49 02 48.9 


17.32 


3.18 


3.876 


0.99 






14 46 43.36 


60 27 14.3 


19.79 


1.81 


3.777 


1.00 






15 03 28.88 


04 19 49.0 


17.96 


124.97 


3.664 


0.84 






15 06 43.81 


53 31 34.3 


18.94 


14.63 


3.790 


0.32 


3 


abs 


16 17 16.49 


25 02 08.1 


19.99 


2.35 


3.943 


0.98 




BAL 


16 19 33.65 


30 21 15.1 


19.52 


4.19 


3.794 


0.92 


3 




16 39 50.52 


43 40 03.7 


17.96 


25.23 


3.990 


0.20 


1,2,3 


abs 


16 43 26.24 


41 03 43.5 


20.10 


65.01 


3.873 


1.00 







The columns give: (1,2) SDSS coordinates; (3) SDSS dcreddcned psf tab magnitude; (4) FIRST 
total radio flux density; (5) SDSS redshift (a = revised redshift from DR5Q, see Section 2) ; 
(6) NN output (see Section 4.2); (7) labels 1, 2 and 3 indicate QSOs included in the samples of 
Benn et al. (2002), Holt et al. (2004) and Carballo et al. (2006) respectively; (8) BAL = broad 
absorption line QSO; abs = the Lya line appears to be self-absorbed. 
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Table 3. Combinations (A - H) of optical and radio data used as 
input variables to the neural network 





A 


B 


c 


D 


E 


F 


G 


H 


r magnitude 


X 


X 


X 


X 


X 


X 


X 


X 


u- g 


X 


X 


X 


X 


X 


X 


X 


X 


g-r 


X 


X 


X 


X 


X 


X 


X 


X 


r — i 


X 


X 


X 


X 


X 


X 


X 


X 


i — z 


X 


X 


X 


X 


X 


X 


X 


X 


rad-opt separation 




X 






X 


X 




X 


logio[Stotai(l-4 GHz)] 






X 




X 




X 


X 


logloC'S'total/'S'pcak) 








X 




X 


X 


X 



al. 2003, Carballo et al. 2006). We therefore took this red- 
shift as an initial threshold for high-2 QSOs, although we 
explored redshifts below z — 3.6 to find the optimal value 
for the optical bands used in this work. 

The learning algorithm applied was a feed-forward NN 
(Bishop 1995) with a layer for the input parameters, i.e. the 
data, and an output layer with a single variable y, set during 
training to 1 for high-z QSOs and for the remaining types. 
Output y for object i is given by: 

1 + e 



revisions from DR5 such as shifts in redshift below this limit 
or changes between the non high-z QSO categories would 
not affect the results. 

The fraction of QSOs with z ^ 3.6 is 1.37% (52/3806) 
and they are listed in Table 2. All these QSOs are included 
in DR5Q and only one of them, SDSS 141209. 96-f 062406.8, 
with a deep and wide absorption feature bluewards of the 
Lya emission line and starting at the Lya line, has a re- 
vised redshift at DR5Q, z = 4.467 versus z = 4.421 at DR5. 
DR5Q provides interesting complementary information for 
these QSOs and for the remaining ones in Table 1, including 
i band absolute magnitudes, g — i 'differential colour' with re- 
spect to the typical value for the QSO redshift, and matches 
to ROSAT All-Sky Survey (RASS; Voges et al. 1999, 2000) 
and 2MASS when available. 

The FIRST-SDSS sample was obtained using a simple 
one-to-one match between radio and optical sources (within 
a 1.5 arcsec radius), therefore missing the class of double- 
lobe QSOs without detected radio cores. Using the statistics 
found by de Vries et al. (2006, their table 2) for a sample of 
5515 FIRST-SDSS QSOs with radio morphological informa- 
tion within 450 arcsec, the fraction of FIRST-SDSS double- 
lobe QSOs with undetected cores with respect to the total 
FIRST-SDSS QSO sample is 3.7 per cent. Since the starting 
samples of SDSS QSOs in de Vries et al. (2006) and in this 
work obbey similar SDSS selection criteria, the last value is 
a good estimate of the incompleteness of the QSO samples 
in our work due to the exclusion of lobe dominated QSOs. 



3 SEPARABILITY OF HIGH-REDSHIFT QSOS 
WITH A NEURAL NETWORK 

Only 52 of the 4248 sources with DR5 spectra are classified 
as 2 ^ 3.6 QSOs (Table 1). Our goal is to train a classifier 
to recognize high-z QSOs among the 4415 objects without 
SDSS spectra, i.e. the 'unlabelled' sources, after learning the 
class properties from the 4248 objects with spectra, i.e. the 
'labelled' sources. The adopted procedure was to consider a 
two-class problem, with high-z QSOs as one class and the 
remaining types as the other. Since the training uses objects 
whose class is known, the learning is said to be 'supervised'. 

Previous selections of high-z RL QSOs as FIRST 
sources with optical counterparts on POSS-APM revealed 
an abrupt change in O — -B colour with redshift at ~ 3.6, 
the latter allowing efficient separation of high-z QSOs from 
the QSO population as a whole (Benn et al. 2002, Vigotti et 



d 

with — Wo + ^ ^ Wj ■ x^j , 

where {xi,X2, ■ ■ ■ ,Xd) are the inputs, a is a linear function 
of the inputs, and / is the non-linear function known as 
a logistic sigmoid, with outputs in the range (0, 1). This 
NN model is known as logistic linear discriminant, wo and 
{wi,W2, ■ ■ ■ ,Wd), called bias and weights respectively, are 
the parameters fitted during training. The adopted error 
function was the mean of the squared errors of the outputs. 



1=1 

where m is the number of objects for the training and 
t is the desired value of output y or target value. The 
optimal parameters for the net, i.e. those minimising 
the error, were obtained using the Levenberg-Marquardt 
algorithm, available in the MATLAB Neural Network 
Toolbox (http://www.matchworkds.com/). The Levenberg- 
Marquardt algorithm appears to be the fastest method for 
training moderate-sized NNs (Hagan & Menhaj 1994), and 
its efficient implementation in Matlab further improves its 
performance. 

As input data we tried various combinations (A - H) of 
variables shown in Table 3. A pre-processing was performed 
normalising each variable to the range [—1, 1]. For this step 
the total sample was used (unlabelled sources included) since 
the inputs for the new objects presented to the net need to 
have the same normalization as the data used in the learning 
process. The output values, ^ y ^ 1, give the degree of 
similarity with the class of high-z QSOs. Objects with y 
exceeding a given threshold j/c would be classified as high- 2; 
QSO candidates. 

The quality of an NN for classification is evaluated in 
terms of its efficiency, eff, and completeness, comp. Efficiency 
is the fraction of sources with y ^ yc that actually are high-z 
QSOs. Completeness is the fraction of actual high-z QSOs 
with y J5 j/c. A good separation between two classes will 
show efficiency increasing and completeness decreasing as y^ 
increases. Since our purpose is to build a sample appropri- 
ate for statistical analysis, priority is given to completeness, 
accepting low yc values at the cost of lower efficiency. 

The classifier has to be empirically tested using a set 
of objects not used for the learning. Because of the small 
number of high- 2; QSOs, the training and test samples were 
separated adopting the partition method known as 'leave- 
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Figure 2. Efficiency versus completeness of tlic NN searcli for 
high-2 QSOs, for j/c values 0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9 
(right to left) and the eight sets of input variables A-H. The 
adopted redshift cut for high-2 QSOs was z = 3.6. A sample 
poisson error bar is plotted for each set. 
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Figure 3. Redshift distribution for the 28 z < 3.6 QSOs with 
J/ ^ 0.1 (i.e. the contaminants), using the best-performing set of 
inputs, B. 



onc-out', repeatedly dividing the data sot of m instances 
into a training set of size m — 1 and a test set of size 1, in all 
possible ways. This procedure yielded m classifiers, one per 
test object. Since the m objects (4248) are used for testing, 
a good estimate of the performance can be obtained. 

The efficiency and completeness as a function of j/c for 
each of the eight sets of input variables is shown in Fig. 2. 
For all sets of features the expected trend between eff and 
comp is present, and classifiers are found yielding comp ^ 
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Figure 4. Similar to Fig. 2, but using the set of inputs B and 
with Zcut = 3.5. The number of QSOs with higher redshift than 
this is 69. 



90 per cent and eff ^ 60 per cent for particular yc values. 
Among these sets of inputs, B and E gave the maximum 
completeness, ~ 96 per cent, both yielding ~ 62 per cent 
efficiency. B weis selected as the best classifier since it uses a 
smaller number of input variables. In total, 81 sources have 
y ^ 0.1 for this set, of which 50 are QSOs with z ^ 3.6, 
yielding completeness 50/52 = 96±14 per cent and efficiency 
50/81 = 62 ± 9 per cent for y > 0.1. The 31 contaminants 
include one star, two galaxies and 28 z < 3.6 QSOs. The 
redshift distribution of the latter is shown in Fig. 3. 19 of 
the QSOs, i.e. a fraction 19/31 = 61 ± 14 per cent of the 
contaminants, have redshifts 3.2 < 2 < 3.6, very close to 
the selected threshold. 

Using a redshift cut Zcut ~ 3.5 and the set of inputs B 
we found comp = 90 ± 11 per cent and eff= 68 ± 9 per cent 
for J/c = 0.1. (Fig. 4). Since completeness was prioritised we 
kept 2cut = 3.6, yielding completeness ^ 96 per cent. 

The inputs used for the learning were basically the 
SDSS colours, which are known to be correlated, especially 
for QSOs (Weinstein et al. 2004). This means that we could 
have applied some preprocessing algorithms to reduce the 
input space dimension, and therefore improve its sampling. 
In fact, the approach of logistic linear regression does not 
need to assume that the variables are independent, and the 
presence of covariance in the input data does not affect the 
quantification of the parameters of the optimal hyperplane 
separating high-z QSOs from the remaining classes except 
for causing a sampling of the input space lower than neces- 
sary. The good performance found for the test set of unseen 
data gives us confidence that although probably redundant, 
the selected set of input features is appropriate. 

Several works described in the Introduction select QSOs 
from SDSS using colour and/or photometric information 
covering the five bands. Suchkov et al. (2005) apply a k- 
dimensional decision tree classifier and use five colours {u—g, 
g — r,r — — z and g — i) as input data. Gao et al. (2008) 
apply decision trees and support vector machines to select 
QSOs from a combined SDSS-2MASS sample, using several 
sets of input data including in all cases SDSS photometry 
at the five bands (magnitudes in all bands or four colours 
and a magnitude). They found the best results for the in- 
put set with four SDSS coulours and the r-band magnitude, 
which is the set of optical data selected in our work. Ball 
et al. (2006, 2007) use decision trees to select QSOs from 
SDSS and a fc-nearest neighbour instance-based approach 
to quantify QSO photometric redshifts (Ball et al. 2008), 
in both cases using as input data four colours in the four 
magnitude types (PSF, fiber, Petrosian and model). Ball et 
al. (2007) applied genetic algorithms to investigate subsets 
of the original 16 inputs in a systematic way and found that 
no subset resulted in a significant improvement and some 
subsets were even worse, therefore electing to keep the full 
SDSS information available. 

Since we adopted the leave-one-out approach and use 
all the sources as test, training and testing objects form the 
same sample, i.e. the sample of sources with DR5 spectra. 
This means that the quoted figures of 96 ± 14 completeness 
and 62 ± 9 per cent efficiency refer to the selection of high-2 
QSOs among the labelled sources, i.e. within the interpola- 
tion regime. 

The labelled sample contains, among the 52 high-2 
QSOs, 12 sources with broad absorption lines (BALs) or 
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Figure 5. Normalized input parameters r, u~g, g — r, i — i, i — z and radio-optical separation for labelled high- 2; QSOs (o), labelled non 
high-^ QSOs (•, 8 per cent shown), and unlabelled sources (x, 8 per cent shown). The numbers indicate the minimum and maximum 
value of each variable (in magnitudes or arcsec) for the sources in the plot. 



self-absorption at Lya (sec Tabic 2). The classifier recovers 
50/52 of the high-« QSOs and 11/12 of the BALs and self- 
absorbed at Lya, proving to be effective in the selection of 
QSOs with this type of absorption features. On the other 
hand, we confirmed that all 52 high-z QSOs in the training 
sample have at least an emission line with FWHM ^ 1000 
km sec~^, i.e. none of them belongs to the class of "narrow- 
lined" (type II) QSOs. Therefore the classifier trained in 
this work targets the selection of high-2 QSOs with at least 
a broad emission line, including those cases presenting ab- 
sorption features in the form of BALs or Lya absorption. 

Figure 5 shows the distribution of the six variables used 
as input data r, u—g^ g—r, r—i, i—z and radio-optical offset, 
separating (from left to right) labelled high- 2; QSOs, labelled 
non high-z QSOs and unlabelled sources. For non high-« 
QSOs (4196) and unlabelled sources (4415) we used for the 
plot representative subsets containing eight per cent of the 
sources, to improve visualization. For non high-z QSOs this 
proportion was applied separately for QSOs, galaxies, early- 
type stars, late-type stars and 'unknown', to keep the same 
fractions as in the original sample. Most of the sources in this 
sample are QSOs with z < 3.6 (3906/4248 = 90 per cent). 
The scale is linear, and although normalized variables were 
used, their ranges in physical units can be inferred from the 
marked numbers in the figure, showing the minimum and 
maximum value of each variable for the represented sources. 
Table 4 presents the mean, standard deviation and median 
for each variable, for the complete three samples, as well as 
for the whole sample of labelled sources. In Fig. 5 and Table 
4 all sources were included regardless of their photometric 



Table 4. Statistics (mean, standard deviation and median) of the 
selected input variables for various SDSS-DR5 subsamples 





high-2 


non high-z 


lab. 


unlab. 




QSOs 


QSOs 








(52) 


(4196) 


(4248) 


(4415) 


r 


19.36±0.69 


18.90±0.86 


18.90±0.86 


19.33±0.81 




19.53 


19.00 


19.01 


19.56 


u-g 


3.34±1.20 


0.54±0.75 


0.57±0.82 


0.53±0.70 




3.07 


0.31 


0.32 


0.32 


g-r 


1.40±0.45 


0.27±0.31 


0.29±0.33 


0.27±0.30 




1.24 


0.21 


0.21 


0.21 


r—i 


0.17±0.22 


0.16it0.21 


0.16±0.22 


0.15±0.20 




0.13 


0.14 


0.14 


0.13 


i—z 


0.06±0.18 


0.11±0.17 


0.11±0.17 


0.12±0.18 




0.10 


0.09 


0.09 


0.11 


sep(") 


0.30±0.20 


0.40±0.33 


0.40±0.33 


0.39±0.32 




0.25 


0.29 


0.29 


0.28 



errors. The errors at g, r, i and z are less than 0.2 mag (ex- 
cept for one unlabelled source with Ar = 0.24 and eight with 
Az = 0.2 — 0.3). For the u band the errors exceed 0.2 mag 
for all liigli-z QSOs, 246 labelled non high-z QSOs (out of 
4196) and 313 unlabelled sources (out of 4415), with median 
values for these sources of Am = 0.8, 0.5, 0.4 respectively. 

Regarding the comparison among labelled sources, the 
variables that taken individually better discriminate be- 
tween the classes of high-z QSOs and non high-z QSOs are 
the u — g and g — r colours, as expected from the fact that 
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z < 3.6 QSOs dominate the non high-z QSO sample and 
the weU established colour-redshift relation for QSOs (e.g. 
Schneider et al. 2007). Also noticeable is the concentration 
of high- 2; QSOs at the faintest r magnitudes (fainter mag- 
nitude and lower dispersion) and at the lowest radio-optical 
offsets (lower separation and again lower dispersion), com- 
pared to the remaining labelled sources. 



4 APPLICATION OF THE NETWORK TO 
THE UNLABELLED SAMPLE 

4.1 The labelled and unlabelled samples 

The labelled sample consists of the FIRST-SDSS sources 
included in the DR5 spectroscopic catalogue. The content 
of the labelled sample is determined by the way the pho- 
tometric objects were selected as spectroscopic targets by 
SDSS. The selection criteria were mainly aimed at obtain- 
ing samples of galaxies, QSOs and brown dwarfs, with differ- 
ent combinations of selected parameters (e.g. magnitude and 
colour ranges, extension, proximity to catalogued sources at 
other wavelengths) being used for each object typo. The la- 
belled sample cannot therefore be considered as statistically 
representative of the SDSS imaging database. Classes not 
considered in the spectroscopic selection criteria may be ab- 
sent or poorly represented in the spectroscopic catalogue. 
The unlabelled sample is therefore a mixture of the sources 
in the DR5 spectroscopic area not selected as spectroscopic 
targets (2059 objects within ~ 5553 deg^, compared to 4248 
labelled in the same region), and the sources in the DR5 
photometric area but outside the spectroscopic area (2356 
objects within ~ 1838 deg^). 

A general concern about classification is the application 
of an algorithm trained on a sample of data to a different 
set of data, likely covering other regions of input space. The 
extension of the classifier beyond the original sample used 
for the training is framed in the context of extrapolation 
versus interpolation. In our case the training set is the DR5 
spectroscopic survey or labelled sample, and our aim is to 
apply the classifier to the sources in the DR5 photometric 
sample without DR5 spectra. We expect a rcEisonable over- 
lap between labelled sources and the sources located outside 
the spectroscopic area, since a large fraction of the latter 
would have been SDSS spectroscopic targets if included in 
the spectroscopic area [a fraction 4248/(4248+2059) = 67 
per cent using the statistics in DR5 spectroscopic area]. A 
poorer overlap is expected between labelled sources and the 
unlabelled sources in DR5 spectroscopic area. 

Figure 5 and Table 4 allow to compare the distribution 
of each input variable for the classes of labelled and unla- 
belled sources. Although Fig. 5 does not include labelled 
sources as a whole, the distributions for this set are ap- 
proximately similar to those for labelled non high-2: QSOs, 
since high-z QSOs make only 1.2 per cent (52/4248) of the 
labelled sources. The main effect of including high-z QSOs 
would be to increase the ranges of the u—g and g— r colours. 
The statistics of mean, standard deviation and median for 
the whole labelled and unlabelled groups arc given in Ta- 
ble 4 (columns 3 and 4). Figure 5 and Table 4 show a good 
agreement between the distributions of each of the input 
variables for unlabelled and labelled sources, except for the 
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Figure 6. Distribution of NN output y for the unlabelled FIRST- 
DR5 sources and the separation high-^ QSO versus remaining 
classes, assuming the trained NNs (see Section 4.2). The boxes 
mark the median values. The sources to the left of the vertical 
line (2356) are located outside the DR5 spectroscopic area, and 
those to right (2059) inside this area. 



r band magnitudes, which are fainter for unlabelled sources, 
with mean and median of 19.33 and 19.56 versus 18.90 and 
19.01. These comparisons between unlabelled and labelled 
sources use each input variable individually. More complex 
separations between labelled and unlabelled sources would 
be expected in the six-dimensional input space. 

We applied the trained classifier to the unlabelled sam- 
ple in order to select new high-2: QSO candidates and to ob- 
tain an estimate of the completeness of SDSS spectroscopy 
for radio QSOs with 3.6 ^ z ^ 4.6. The sample of new high- 
z QSO candidates is presented in Sect. 4.2. In Sect. 4.3 we 
discuss the performance obtained in the unlabelled sample, 
on the basis of available spectroscopic identifications from 
observations in this work, taken from the literature or from 
SDSS DR6. 



4.2 The sample of high-z QSO candidates 

In Section 3 we demonstrated the good performance of sim- 
ple NNs for separating, on the basis of optical photometry 
and radio data, high- 2; QSOs from the remaining classes in 
the labelled sample. This evaluation was based on the out- 
puts of rn = 4248 labelled objects, using for each object a 
NN trained with the remaining 4247. We adopted set B of 
input variables, which for threshold j/c = 0-1 yielded com- 
pleteness 96 ± 14 per cent and efficiency 62 ± 9 per cent. 
The m NNs were applied to the unlabelled sample and the 
resulting m values of y for each source, and their medians 
i/mcd! are shown in Fig. 6. The number of unlabelled sources 
with median outputs exceeding yc — 0.1 is 58 (31 outside 
and 27 within the DR5 spectroscopic area), and these are 
our 'high- 2: QSO candidates'. Their properties are listed in 
Table 5, including the median of the NN outputs. The same 
parameter for the high-2; QSOs in the labelled sample is 
listed in Table 2 (but in this case the QSOs themselves were 
used in the training) . 
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Table 5. 2 ^ 3.6 QSO candidates identified by the NN in the FIRST-SDSS DR5 unlabelled sample 



RA 


Dec 


rAB 


•Si. 4 GHz 


NN output 


Rcdshift 




Notes 


In DR6 


J2000 


J2000 




mjy 


ymed 


NED WHT 


DR6 




spec, area? 


(1) 


(2) 


(3) 


(4) 


(5) 


(6) 




(7) 


(8) 


07 25 18.26 


37 05 18.3 


19.61 


26.56 


1.00 


4.33 








07 47 11.15 


27 39 03.3 


18.35 


1.55 


0.11 


4.17 






yes 


07 47 38.49 


13 37 47.3 


19.37 


6.62 


1.00 


4.04 




BAL 




08 07 10.74 


13 17 39.4 


20.01 


48.88 


0.56 


3.70 


3.726 




yes 


08 15 55.01 


46 53 21.4 


19.88 


3.73 


0.14 








yes 


08 23 23.32 


15 52 06.8 


19.31 


79.33 


1.00 


3.79 


3.781 




yes 


08 33 16.91 


29 22 28.0 


20.13 


12.73 


0.49 








yes 


08 43 23.69 


16 56 56.1 


19.66 


2.36 


0.20 






DR6 unknown 


yes 


08 48 18.88 


39 38 06.0 


20.15 


0.72 


0.18 








yes 


08 52 58.87 


22 50 50.5 


20.16 


45.33 


0.95 


3.55 






yes 


08 55 01.82 


18 24 37.8 


19.96 


9.43 


0.79 


3.96 


3.966 




yes 


08 59 44.06 


21 25 11.1 


18.72 


23.96 


0.82 


3.70 


3.699 




yes 


09 02 54.17 


41 35 06.5 


20.12 


0.93 


0.99 


3.69 






yes 


09 09 53.84 


47 49 43.0 


19.89 


383.67 


0.37 








yes 


09 14 36.22 


50 38 48.5 


20.16 


51.00 


0.30 








yes 


10 09 33.22 


25 59 01.2 


20.13 


3.42 


0.37 






DR6 unknown 


yes 


10 10 20.85 


28 51 50.1 


20.11 


2.62 


0.28 






DR6 z = 0.58 galaxy 


yes 


10 19 39.00 


19 03 12.0 


20.11 


0.74 


0.20 






DR6 z = 0.45 galaxy 


yes 


10 29 40.93 


10 04 10.9 


19.47 


3.22 


0.25 








yes 


10 34 20.43 


41 49 37.5 


20.12 


1.98 


0.50 








yes 


10 52 25.06 


25 15 41.3 


20.10 


5.26 


0.16 




3.404 




yes 


10 58 07.47 


03 30 59.6 


19.92 


4.62 


0.15 








yes 


11 05 43.86 


25 53 43.1 


20.09 


2.76 


0.98 




3.779 




yes 


11 09 46.44 


19 02 57.6 


20.04 


6.95 


1.00 










11 23 39.59 


29 17 10.7 


19.47 


3.14 


0.78 




3.771 




yes 


11 46 41.02 


12 52 02.9 


20.19 


3.01 


0.11 








yes 


11 51 07.42 


50 15 58.5 


20.08 


1.22 


0.54 








yes 


11 54 49.36 


18 02 04.4 


19.63 


39.06 


0.87 




3.688 




yes 


12 04 07.83 


48 45 48.2 


19.97 


2.64 


0.15 








yes 


12 05 31.73 


29 01 49.2 


20.17 


1.46 


0.55 








yes 


12 13 29.42 


-03 27 25.7 


19.64 


25.53 


0.47 








yes 


12 20 27.96 


26 19 03.5 


18.12 


35.04 


0.94 


3.694 


3.697 




yes 


12 21 35.64 


28 06 13.8 


19.77 


28.76 


0.11 


3.305 


3.288 




yes 


12 28 19.96 


47 40 30.4 


19.32 


2.22 


0.59 








yes 


12 31 28.22 


18 47 14.3 


19.41 


11.17 


0.13 










12 43 23.16 


23 58 42.2 


19.87 


63.44 


0.63 










12 44 43.06 


06 09 34.6 


19.78 


1.36 


0.20 








yes 


13 12 42.86 


08 41 05.0 


18.53 


3.93 


0.43 


3.73 


3.743 




yes 


13 20 53.63 


10 37 51.5 


19.46 


8.43 


0.23 


3.42 


3.431 




yes 


13 22 27.58 


53 52 09.2 


19.68 


2.51 


0.10 


1.23 






yes 


13 37 59.43 


36 34 20.6 


20.17 


2.88 


0.12 


1.07 






yes 


13 42 01.42 


05 01 56.0 


20.11 


27.24 


0.14 


3.17 






yes 


13 48 54.37 


17 11 49.6 


19.13 


1.90 


0.99 


3.60 








13 49 18.52 


35 24 15.7 


19.77 


81.88 


0.11 


1.22 






yes 


14 06 35.67 


62 25 43.3 


19.72 


11.50 


0.47 


3.89 




abs 


yes 


14 34 13.05 


16 28 52.7 


19.86 


4.95 


1.00 


4.21 








14 53 29.01 


48 17 24.9 


20.11 


3.75 


1.00 


3.77 






yes 


15 11 46.99 


25 24 24.3 


19.95 


1.39 


1.00 




3.719 


BAL 


yes 


15 20 28.14 


18 35 56.1 


19.82 


6.94 


0.96 


4.11 








15 24 24.35 


07 31 29.9 


20.13 


1.51 


0.39 


3.59 






yes 


15 33 36.14 


05 43 56.5 


19.84 


28.29 


0.99 


3.93 








15 37 56.90 


48 23 32.3 


20.00 


3.07 


0.97 


1.34 






yes 


16 37 08.29 


09 14 24.6 


19.56 


9.43 


0.93 


3.75 








17 02 53.54 


23 57 58.0 


19.74 


19.24 


0.25 






unknown 


yes 


17 20 02.17 


24 55 48.8 


19.82 


12.90 


0.16 


3.34 




BAL 




22 28 14.39 


-08 55 25.7 


20.19 


1.99 


0.99 


3.64 






yes 


22 35 35.59 


00 36 02.0 


20.14 


4.32 


0.98 


3.87 






yes 


23 50 22.39 


-09 51 44.3 


19.68 


6.51 


0.10 


3.70 






yes 



The columns give: (1,2,3,4) similar to Table 2; (5) NN output (see Section 4.2); (6) redshifts from NED 
(SDSS 072518.26-1-370518.3, SDSS 074711.15-1-273903.3 and SDSS 122027.96+261903.5 from Benn ct al. 2002; SDSS 
122135.64-1-280613.8 from Mason et al. 2000), from this work or from DR6; (7) BAL = broad absorption line QSO; abs = 
the Lya line appears to be self-absorbed; (8) indicates the sources located within the spectroscopic plates available in DR6. 



10 R. Carballo et al 



4.3 Spectroscopic check of the QSO selection 

The quality of the selection of high-z QSO candidates was 
tested by comparison with spectroscopy from the NED 
database, from a dedicated observing programme with ISIS 
at the William Herschel Telescope, and from spectroscopic 
classifications from SDSS DR6. The resulting classifications 
and redshifts are included in Table 5. 



4-3.1 Spectroscopic classifications from NED 

Counterparts of the 58 high-2 QSO candidates and the 4357 
non-candidates (4415 unlabelled sources) were sought in 
NED in February 2007, using a search radius of 5 arcsec. 
A similar radius is used in the SDSS DR5 Quasar Catalog 
(Schneider et al. 2007) to quote the association with a NED 
object. Four of the candidates were spectroscopically clas- 
sified, aU QSOs with z > 3.3. Benn et al. (2002) identified 
three of them, with redshifts 4.33, 4.17 and 3.694, the re- 
maining candidate, with z = 3.305, was identified by Mason 
et al. (2000). 

Amongst the non-candidates, 382 were associated with 
QSOs (blazars excluded) with secure redshifts, and none of 
them had z ^ 3.6. Another 13 QSOs had ambiguous or 
uncertain redshifts but none were consistent with z ^ 3.6. 



4-3.2 Spectroscopy with ISIS 

Spectra of 27 of the remaining 54 candidates were obtained 
with the WHT's ISIS dual-arm spectrograph in two runs 
on 2007 April 3, 4, 6 and 7 and July 8, 9 and 10. The 
R158R grating was used on the red arm, yielding a wave- 
length range 5300-10000 A with dispersion 1.8 A pixeP^ 
On the blue arm the R300B grating was used, giving a spec- 
tral range 3000-6000 A with dispersion 0.9 A pixePV Ex- 
posure times were 600 s. Spectrophotometric standard stars 
were observed in order to calibrate the instrumental spec- 
tral response. Seeing was typically better than 1 arcsec and 
the slit width was set to 1 arcsec, yielding spectral resolu- 
tion, as measured from sky lines, of 7.7 A and 4.5 A in the 
red and blue arms respectively. Standard data reduction was 
carried out using the IRAFlj package. Arc-lamp exposures 
were used for wavelength calibration, giving solutions with 
residuals < 0.1 A. 

We observed 15 sources in April, prioritising the candi- 
dates with higher NN output y, brighter r magnitude, and 
at lower air mass, regardless of their location inside or out- 
side the DR5 spectroscopic area. All 15 sources were clas- 
sified as QSOs and their redshifts were determined as the 
average of the values measured from individual emission- 
line centroids, usually Lya, Si iv, and C iv. 13 of the QSOs 
have 3.60 ^ z ^ 4.21, and the remaining two z — 3.55 and 
z = 3.42. 

The 12 sources observed in July consist of all the re- 
maining candidates with right ascension in the range 13 - 
24 hours and without spectra in DR6. The results of the 

^ IRAF is distributed by the National Optical Astronomy Ob- 
servatories, which is operated by Association of Universities for 
Research in Astronomy, Inc., under cooperative agreement with 
the National Science Fundation 



spectroscopic classification were as follows. 11 sources were 
classified as QSOs: four with 3.6 ^ z ^ 3.9, four with 
1.07 < z ^ 1.34 and three with z = 3.17, z = 3.34 and 
z = 3.59. One candidate remained unclassified due to both 
low signal-to-noise and lack of clear absorption or emission 
features in the spectrum. 

Fig. 7 shows the spectra of the 21 candidates identified 
with ISIS as 2 > 3.2 QSOs (17 of them with z > 3.6). 

4.3.3 Spectroscopy from SDSS DR 6 

SDSS DR6 (CAS SpecObj view) includes spectra of nine of 
the remaining 27 candidates (and also of six of those already 
observed with ISIS in April, and two from the NED) . These 
nine include four QSOs with 3.6 ^5 2 ^ 3.8, one a,t z — 3.40, 
two galaxies with 2 — 0.45 and z = 0.58, and two sources 
labelled as unknown. 

Amongst the 4357 non-candidates, 898 have spectra 
from DR6, and are classified as stars or galaxies (71), 
unknown (62) or QSOs (765). The latter include some 
of the QSOs with spectra from NED cited in Section 
4.3.1. Two QSOs have z ^ 3.6, but their SDSS red- 
shifts are incorrect (see the DR6 spectra in Fig. 8). For 
SDSS 075558. 88-m3210.9, with quoted 2 = 4.340 (con- 
fidence=0.51), we measured from the emission lines of 
Mg II (erroneously taken as Lya by the SDSS pipelines) 
and C III] a redshift z = 1.295. The spectrum of SDSS 
161836.094-153313.5, with 2 = 4.376 (confidence^O.OO), re- 
sembles that found for sources SDSS 130941.36-M12540.1 
and SDSS 153420.23+413007.5 (see Section 2), classified in 
DR5 as QSOs with redshifts z = 4.395 and z = 4.426, but 
with revised values in DR5Q z = 1.362 and z = 1.400, 
due to the re-interpretation of the assumed Lya emission 
line as Mg 11. A similar revision gives z — 1.322 for SDSS 
161836.09+153313.5. 



4.3.4 Performance of the selection of high-rcdshift QSOs 

Spectra are available for 40 candidates, obtained from NED, 
SDSS DR6, or observed for this work, and 24 of them are 
confirmed z ^ 3.6 QSOs. 

Figure 9 shows a plot of the NN output j/mod versus 
r magnitude for the 58 high-2 QSO candidates. Different 
symbols correspond to the different spectral classes (high- 
z QSOs are shown as diamonds). The two panels separate 
the candidates located within and outside the spectroscopic 
area available in DR6 (as noted in the last column of Table 
5) . Fig. 9 shows a larger fraction of high-2 QSOs among the 
candidates with higher NN outputs, a trend also evident in 
Fig. 2. The efficiency for 0.55 ^ 1/ < 1 is 91 per cent (20 
z 3.6 QSOs out of 22 candidates with available spectra), 
dropping to 22 per cent (four of 18) for 0.1 ^5 j/ < 0.55. 
Spectra are available for all 20 high-z QSO candidates with 
13*^ < RA < 24*^, and this set therefore forms a complete 
subsample with regard to the distribution of NN outputs. 
With 12 confirmed z ^ 3.6 QSOs out of 20 candidates, the 
efficiency from this sample is 60 ± 17 per cent. This value 
is in good agreement with the efficiency 24/40 — 60 + 12 
per cent for the total sample of candidates with spectra, 
and with the 62 + 9 per cent efficiency obtained within the 
labelled sample (Section 3). 
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Figure 7. WHT spectra of 21 of the 58 NN ymed ^ 0.1 candidates (six also included in DR6), including 17 2 ^ 3.6 QSOs, and four QSOs 
at z = 3.34, z = 3.42, z = 3.55 and z = 3.59. Emission features are labelled by ion. Symbol © indicates the position of the atmospheric 
absorption band O2 A. 
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Figure 7. Continued 
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Figure 8. DR6 spectra of two non-candidates with revised redshifts z = 1.29 and z = 1.32 listed in SDSS-DR6 as z ^ 3.6 QSOs, due to 
misidentification between the Lya and Mg II emission lines. 
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Figure 9. NN output y versus r magnitude for the 58 high-z QSO candidates, located either within (left panel) or outside {right panel) 
the spectroscopic area available in DR6. o: QSOs with z ^ 3.6, o: QSOs with 3.2 ^ z < 3.6, +: other classification or unknown, x: 
without spectrum. 



None of the non-candidates in the unlabelled sample 
with available spectrum from NED or DR6 is a z ^ 3.6 
QSO, therefore we have no evidence of incompleteness, with 
respect to DR5 unlabelled sources with matches in any of 
these databases. In fact a good completeness was expected 
for the matches with the DR6 spectroscopic survey, since 
the selection of sources, classification and measured prop- 
erties respond to the same scheme as for the DR5 spectro- 
scopic catalogue used for the training, where we had found 
96 per cent completeness. However, the NED database pro- 
vides spectroscopic identifications assigned from other sur- 
veys, and the absence of NED high-z QSOs among the nou- 
candidates gives therefore independent evidence that the 
classifier has a good completeness in its extension to DR5 
unlabelled sources. Three DR5 unlabelled sources are iden- 
tified as z ^ 3.6 QSOs in NED, all of them selected as high-z 
QSO candidates in our work. 



5 DISCUSSION AND CONCLUSIONS 

In this paper we aimed to obtain a sample of z ^ 3.6 QSOs, 
starting from an initial complete sample of 8665 FIRST 
sources with star-like counterparts in the SDSS DR5 photo- 
metric survey, of which 4250 have spectra in DR5, 52 of them 
being z ^ 3.6 QSOs. We found that simple supervised NNs, 
trained on the sources with DR5 spectra, and using optical 
photometry and radio data as input parameters, allow sep- 



aration of high-z QSOs from the remaining spectral classes 
with 96 per cent completeness and 62 per cent efficiency. 
The application of the NNs to the sample of 4415 sources 
without DR5 spectra yielded 58 high-z QSO candidates. 

We obtained spectra of 27 of the 58 candidates, and 
17 were confirmed as high-z QSOs. Spectra of 13 additional 
candidates from the NED and from DR6 revealed seven more 
high-z QSOs, yielding a total 40 candidates with spectra 
available, of which 24 are high-z QSOs. The number of high- 
z QSOs was then increased from 52 in the initial sample to 
76 (a factor 1.46). 

The overall efficiency in the selection of new high-z 
QSOs is 60 ± 12 per cent (24/40). The estimate from a sub- 
sample unbiased with respect to the NN outputs is 60 ± 17 
per cent (12/20), and both values arc in good agreement 
with the 62 ± 9 per cent efficiency obtained for the DR5 
labelled sample (Section 3). 

None of the non-candidates with spectra available from 
NED or DR6 is a z > 3.6 QSO, therefore we have no evidence 
of incompleteness regarding high-z QSOs with matches in 
these catalogues. Since the NED spectroscopic identifica- 
tions are assigned from a variety of surveys different than 
SDSS, the database provides a blind test of the good com- 
pleteness of the classifier for DR5 unlabelled sources. 

The efficiencies found are well above the values obtained 
for previous samples of RL high-z QSOs, based on less ac- 
curate optical photometry and with fewer wavelength bands 
than SDSS, although already highly complete (^ 95 per 
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cent) regarding optical colour selection. The efficiencies for 
various samples, summarised by Carballo et al. (2006) for 
z ^ 3.7, are 12 - 13 per cent (Holt et al. 2004, Carballo et 
al. 2006; 51.4 ghz > 1 mjy, APM POSS E, O), 6 per cent 
(Hook et al. 2002; Ss ghz > 30 mJy and radio flat, APM 
POSS E, O), and 19 per cent (Snellen et al. 2001; Ss ghz > 
30 mJy and radio flat, APM UKST B, R, I). 

Adopting for the 18 candidates which still lack spec- 
troscopy a weighted efficiency of 37 per cent (four candidates 
with y ^ 0.55 and 14 with y < 0.55, with expected efficien- 
cies of 91 and 22 per cent respectively), we calculate ~ 7 
additional z ^ 3.6 QSOs. The FIRST-DR5 sample of high-2 
QSOs is thus expected to contain ~ 83 QSOs (52-1-24-1-7). 
Adopting as a lower limit for completeness the nominal value 
of 96 per cent found for the labelled sample, we calculate 
for the set of 31 high-z QSOs obtained by the classifier (24 
discovered and 7 predicted) a minimum ~ 1 missed high- 2: 
QSO. 

The NNs found 31 contaminants, i.e. non high- 2: QSOs 
erroneously selected as high-2; QSOs, in the labelled sam- 
ple, with a fraction 61 ± 14 per cent (19/31) being QSOs 
with 3.2 ^ 2 ^ 3.6. Among the 40 high-2 QSO candidates 
with available spectra we found 16 contaminants, seven of 
them being QSOs with 3.15 ^5 z ^ 3.6, confirming a high 
rate of QSOs with z near the threshold z = 3.6 among the 
contaminants, 7/16 = 44 ± 17 per cent. 

Our results allow us to obtain an estimate of the in- 
completeness of SDSS for the spectroscopic classification of 
FIRST high-2 QSOs. 47 of the high-2 QSO candidates are 
located in the spectroscopic area covered by DR6 (Table 5, 
Fig. 9 left panel), and 17 of them are 2 J5 3.6 QSOs, ten 
included in the DR6 spectroscopic catalogue and seven not 
included. 15 candidates in this area still lack spectroscopy, 
and assuming for them a weighted efficiency of 31 per cent 
(two candidates with y ^ 0.55 and 13 with y < 0.55), we ex- 
pect another four high-2 QSOs. From this calculation we es- 
timate 11 FIRST high-2 QSOs missed by SDSS (7 QSOs and 
4 candidates), which when compared to 62 (52-1-10) identi- 
fications yields an incompleteness of SDSS for the spectro- 
scopic classification of FIRST 3.6 sC 2 ^ 4.6 QSOs of ~ 15 
per cent (11/73) for r s; 20.2. 

The definition of the original sample of 52 high-2 
FIRST QSOs excluded lobe-dominated morphologies and 
"narrow-lined" QSOs, and included QSOs with BALs or self- 
absorbed at Lya. The same conditions hold for the larger 
sample of 76 QSOs, obtained from the application of the NNs 
trained using these 52 objects to DR5 photometric sources 
without spectra in DR5, and covering a slightly wider region 
of input space than that used by SDSS for QSO targetting. 

In a future paper we plan to analyze the optical lumi- 
nosity function of FIRST-SDSS QSOs at 3.6 ^ 2 ^ 4.6 on 
the basis of this sample. Concurrently we expect to carry 
out spectroscopic observations of the 18 candidates without 
spectra. Given the efficacy of our approach, we intend to 
extend the sample using more updated SDSS data releases, 
increasing the number of training sources and the number of 
high-2 QSO candidates, for which subsequent spectroscopy 
will be planned. We envisage using additional infrared data 
via UKIDSS (UKIRT Infrared Deep Sky Survey). 
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